This notebook is a series of exercises to practice utilizing probabiliteis, distributions and tests by answering questions in relation to titianic passenger data. The following will be presented with headers that incorperate the questions followed by the calculations and a written summary of the result.
In [1]:
import numpy as np
import pandas as pd
titanic_data = pd.read_csv('train.csv')
titanic_data.head(5)
Out[1]:
Checking to see what columns need to be filled using the .info() method.
In [2]:
titanic_data.info()
Filling all NaN ages with the mean of all the ages and confirming with .info() method. We later compensate for this with a functions that remove the mean. This will be pointed out with a 'COMPENSATION:' and define the action as it arises.
In [3]:
titanic_data.Age = titanic_data.Age.fillna(np.mean(titanic_data.Age))
In [4]:
titanic_data.info()
In [5]:
survivors = titanic_data[titanic_data.Survived == 1]
survivor_prob = (len(survivors) / len(titanic_data))
print("There is a " + str(survivor_prob) + " percent chance of survival.")
Our first choice of question is to see the probability that a passenger was male.
In [6]:
male_passenger = titanic_data[titanic_data.Sex == 'male']
prob_male = (len(male_passenger) / len(titanic_data))
print("There is a " + str(prob_male) + " percent probability that a passenger was male.")
Our second choice of question is to find the probablity that a survivor was from Cherbourg.
In [7]:
c_port = survivors[survivors.Embarked == 'C']
prob_c = (len(c_port) / len(survivors))
print("There is a " + str(prob_c) + " percent probability that a survivor was from Cherbourg.")
Plot the distribution of passenger ages. (Bins = 25)
In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
all_ages = []
age_mean = np.mean(titanic_data.Age)
for i, k in enumerate(titanic_data.Age):
if round(k, 3) != round(age_mean, 3):
all_ages.append(k)
H, edges = np.histogram(all_ages, bins=25)
ax = plt.subplot(111)
ax.bar(edges[:-1], H / float(sum(H)), width=edges[1] - edges[0])
ax.set_xlabel("Passenger Age")
ax.set_ylabel("Frequency of Being on Board")
ax.minorticks_on()
plt.show()
In [9]:
less_then_ten = []
for i in all_ages:
if i < 10:
less_then_ten.append(i)
prob_less_then_ten = (len(less_then_ten) / len(all_ages))
print("There is a " + str(round(prob_less_then_ten, 3)) + " probabililty that a passenger was less then 10 year old.")
In [10]:
from scipy.stats import binom
binom.pmf(42, 100, survivor_prob)
Out[10]:
There is a 0.0613 probability that exactly 42 passenger survive out of 100. See above 'Out' for a more precise probability.
In [11]:
1 - binom.cdf(42, 100, survivor_prob)
Out[11]:
There is a 0.198 probability that at least 42 of those 100 passenger survive. See above 'Out' for a more precise probability.
Is there a statistically significant difference between the age of male and female survivors?
In [12]:
from scipy.stats import ttest_ind
survivors_male = survivors[(survivors.Sex == 'male') & (round(survivors.Age,3) != round(age_mean, 3)) ]
survivors_female = survivors[(survivors.Sex == 'female') & (round(survivors.Age, 3) != round(age_mean, 3))]
t_stat, p_value = ttest_ind(survivors_male.Age, survivors_female.Age)
print("Results:\n\tt-statistic: %.5f\n\tp-value: %.5f" % (t_stat, p_value))
In [13]:
plt.figure(figsize=(10, 4))
opacity = 0.5
plt.hist(survivors_male.Age, bins=np.arange(0, 90, 5), alpha=opacity, label="Males")
plt.hist(survivors_female.Age, bins=np.arange(0, 90, 5), alpha=opacity, label="Females")
plt.legend()
plt.title("Age Distribution of Female and Male Survivors")
plt.xlabel("Ages")
plt.ylabel("Survival")
plt.show()
In [14]:
from scipy.stats import ttest_ind
fare_from_q = titanic_data[titanic_data.Embarked == 'Q']
fare_from_c = titanic_data[titanic_data.Embarked == 'C']
t_stat, p_value = ttest_ind(fare_from_q.Fare, fare_from_c.Fare)
print("Results:\n\tt-statistic: %.5f\n\tp-value: %g" % (t_stat, p_value))
There is statistical difference in the fares paid between the passengers at Queentown and Cherboug. This is indicated by the p-value that is less than 0.01.
In [15]:
plt.figure(figsize=(10, 4))
opacity = 0.5
plt.hist(fare_from_q.Fare, bins=np.arange(0, 90, 5), alpha=opacity, label="Queenstown")
plt.hist(fare_from_c.Fare, bins=np.arange(0, 90, 5), alpha=opacity, label="Cherbourg")
plt.legend()
plt.title("Fare Distribution from Queenstown to Cherbourg")
plt.xlabel("Fare Price")
plt.ylabel("Number of Passengers")
plt.show()